Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: aibek Date: Nov 12, 2012 4:08am
Forum: forums Subject: All archive.org PDF and DjVu files are searchable

All native archive.org PDF and DjVu files are searchable, even if you cannot select the text from them. Try it!

That is, all non-opensource archive.org PDF and DjVu files have a OCR layer in them.

You can extract the OCR layer using simple tools. So, if you have a PDF or DjVu file, you do not need to download the Full Text separately -- the Full Text is exactly the same as the OCR layer already present in the files!

I did not know this for a long time, so I assume some others may not know this too.

Reply to this post
Reply [edit]

Poster: garthus1 Date: Nov 12, 2012 8:59am
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

Aibek,

I think this was done some time ago; it is a useful feature.

Gerry

Reply to this post
Reply [edit]

Poster: aibek Date: Dec 1, 2012 8:22am
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

Apparently not too long ago, for the FAQ is still not updated:

IDENTIFIER_djvu.xml this is an xml version of the OCR output which has the word positions (as a bounding box). this is used for building the djvu file, and is used for searching the flip books, and may be [used for] constructing a searchable pdf in the future.
http://archive.org/about/faqs.php#140

Reply to this post
Reply [edit]

Poster: aibek Date: Dec 5, 2012 7:11pm
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

For Google Books (uploaded by ‘tpb’) the PDF file is not searchable, but the DjVu file is.

This is true for the books I checked. It may be that the later Google Books PDF are searchable too; at any rate it is certain that all their corresponding DjVu files (produced by IA) are searchable.

Here are two books acquired in 2007 which have their PDF non-searchable, but DjVu searchable:

http://archive.org/details/anintroductiont00irelgoog
http://archive.org/details/indischesagen02holtgoog

Reply to this post
Reply [edit]

Poster: aibek Date: Dec 7, 2012 6:28pm
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

And, therefore:

All DjVu files are searchable. (even the opensource ones; but check the exceptions below.)

To sum:

All native IA files are searchable, and furthermore, all DjVu files are searchable. (native or not)

-----------------
Technical details
-----------------

Here is how the files are constructed:

1) If images are uploaded -- in the order and format specified in the about/FAQs.php page -- searchable PDF and DjVu files are created out of them. This is how the native IA books are formed.

2(a) If a PDF containing images † is uploaded (like the Google Books’ PDFs), a searchable DjVu file is created out of it. Also, lately a searchable PDF (“Text PDF”) is also created out of it.

2(b) If a text-PDF † is uploaded, a searchable DjVu file is created out of it.

3) If a DjVu file is uploaded, searchable or not, it is not used in any way. (No “derivation”.) Note: I am NOT sure of this step, but the link suggests this: http://archive.org/details/Encyclopedia_Britannica_1911_Complete


† A text-PDF is where what you see is text. The cursor becomes a bar over it, and you can select it with a cursor.

In a “PDF containing images” what you see is not text, but images. (Thus the size of such files is much larger than the corresponding text-PDF.) Such PDFs can be made searchable by adding a “layer” of text got from OCR-ing the images. (And thus the search is not always accurate.)

A text-PDF is formed out of the corresponding “PDF containing images” by first OCR-ing it, and then trying to match everything -- format, fonts, and non-text images -- exactly. I suppose that when the software is in doubt it retains the images, and thus the size of this file too is often many MBs, while text-PDFs containing no images are very small. (Such PDFs would have no greater size than compressed text; i.e., its size should be the same as the corresponding EPUB files.)

For more information check this: http://blog.nitropdf.com/2008/09/text-works-ocrd-scanned-pdf-files/



Reply to this post
Reply [edit]

Poster: Nemo_bis Date: Dec 15, 2012 3:39pm
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

Yes, and DjVu text layer is very useful, Wikisource uses it a lot and would like to use it even more: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA#A_homegrown_reCAPTCHA_clone

Reply to this post
Reply [edit]

Poster: aibek Date: Jan 5, 2013 8:04pm
Forum: forums Subject: Re: All archive.org PDF and DjVu files are searchable

More information on IA pdfs is available here:

http://archive.org/post/464880/pdfs-on-amazon-kindle